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ABSTRACT 

The simulation of subsonic aeroacoustic problems such as the flow-generated sound of wind instruments is 
well suited for parallel computing on a cluster of non-dedicated workstations. Simulations are demonstrated 
which employ 20 non-dedicated Hewlett-Packard workstations (HP9000/715), and achieve comparable 
performance on this problem as a 64-node CM-5 dedicated supercomputer with vector units. The success 
of the present approach depends on the low communication requirements of the problem (small ratio 
of communication to computation) which arise from the coarse-grain decomposition of the problem and 
the use of local-interaction methods. Many important problems may be suitable for this type of parallel 
computing including computer vision, circuit simulation, and other subsonic flow problems. 
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1. Introduction 



2. A suitable class of problems 



The use of workstations for parallel computing is 
a viable and powerful approach. Workstations are 
widely available and relatively inexpensive because 
the technology is driven by a strong market. A 
number of supercomputers have been built using 
workstation-type technology combined with a suit- 
able communication network. At the same time, the 
idea of exploiting clusters of workstations for par- 
allel computing has been attracting more and more 
attention and is growing in popularity. 

One of the challenges of exploiting a cluster of 
workstations for parallel computing is to design the 
computation appropriately to match the communi- 
cation capacity of the cluster, which is usually lim- 
ited as in the case of a shared-bus Ethernet network. 
Another challenge is to exploit clusters of worksta- 
tions which are not dedicated; namely, it would be 
nice if the workstations can be used concurrently by 
a parallel application and by the regular "owners" 
of the workstations for text-editing and other small 
tasks. In this paper, both of the above issues are 
discussed. 

First, an important class of problems is identified 
which is highly suitable for parallel computing on a 
cluster of workstations. This is the area of subsonic 
computational fluid dynamics (CFD) or simply sub- 
sonic aeroacoustics. Then, a prototype distributed 
system is described which includes automatic pro- 
cess migration, and successfully exploits a cluster 
of 25 non-dedicated Hewlett-Packard workstations 
(HP9000/715). In terms of performance, 20 non- 
dedicated workstations achieve comparable perfor- 
mance on simulations of aeroacoustics as a 64-node 
CM-5 dedicated supercomputer with vector units. 
To demonstrate practical use, the present distributed 
system is also applied to solve a real problem, the 
simulation of wind instruments. Performance mea- 
surements of the distributed system as well as rep- 
resentative simulations of wind instruments are pre- 
sented. 



A workstation cluster can be viewed as a 
distributed- memory multiprocessor with small 
communication bandwidth and high latency. Ac- 
cordingly, numerous small messages between the 
processors must be avoided, and few large mes- 
sages are preferred. Further, a computation which 
can be decomposed at a coarse-grain level to reduce 
the communication requirements is a better match 
for a workstation cluster than a fine-grain paral- 
lel computation. Suitable problems which possess 
the above characteristics include problems with lo- 
cal interactions and spatial organization. When 
such problems are decomposed into subregions, 
the communication-to-computation ratio is propor- 
tional to the surface-to-volume ratio of the subre- 
gions. Because of locality, only the boundaries 
of the subregions need to be communicated be- 
tween processors. Thus, one can increase the size of 
the subproblems to reduce the communication-to- 
computation ratio, and to match the communication 
capabilities of the cluster. 

Problems with local interactions and spatial orga- 
nization can be found in computer vision, in cir- 
cuit simulation using waveform relaxation methods 
(reference [1]), in simulations of subsonic aeroa- 
coustics, and possibly other areas. Aeroacoustics 
simulations are the focus of this paper. As we will 
see below, very good results can be achieved on a 
cluster of about 20 workstations linked together by 
a shared-bus Ethernet network. Further, the present 
computations have practical value as they solve real 
problems in subsonic CFD; they are not abstract 
computations which simply demonstrate good per- 
formance. 

Aeroacoustic simulations involve the numerical 
solution of a set of partial differential equations 
(PDE). All PDE-based problems employ a numeri- 
cal grid (spatial organization) to discretize the equa- 
tions, and a numerical method to calculate the fu- 
ture values of variables defined on the grid. There 



are basically two classes of numerical methods for 
solving PDEs: explicit methods which employ lo- 
cal interactions only, and implicit methods which 
lead to matrix equations and non-local interactions. 
Although explicit methods are local and very sim- 
ple to program, they are usually avoided because 
they have the disadvantage of requiring small time 
steps for numerical stability. 

Aeroacoustic simulations are special among other 
PDE-based problems in that they are well suited for 
explicit numerical methods. In particular, the simu- 
lation of subsonic flow and acoustic waves requires 
small time steps to follow accurately the acoustic 
waves step by step. Namely, the time step must 
be comparable to the grid spacing divided by the 
speed of sound, which produces a very small time 
step in the case of subsonic flow. Thus, there is 
a match between the requirements of the problem 
and the requirements of explicit numerical meth- 
ods for small time steps. This match encourages 
the use of explicit methods and makes aeroacoustic 
simulations very suitable for parallel computing on 
a cluster of workstations. 

3. The distributed system 

It is straightforward to develop a distributed system 
for solving spatially-organized local-interaction 
problems on a cluster of workstations. The present 
distributed system has been developed directly on 
top of UNIX and TCP/IP utilizing also the facilities 
of a clustered HP-UX environment such as file- 
locking semaphores and a common-file-system. 
General programming environments such as PVM 
(reference [2]) have not been used in this work be- 
cause the goal is to experiment with new ideas and 
a prototype system. The present distributed system 
consists of four modules: 

• initialization of the global problem, 

• decomposition into subregions, 

• job submission to free workstations, and 



• job monitoring including the automatic process 
migration from busy hosts to free hosts. 

The job submission and job monitoring are per- 
formed by one workstation which can be thought of 
as the "master", while the other workstations are the 
"slaves". The slaves calculate their assigned sub- 
problems independently at every integration step, 
and then communicate the boundaries of their sub- 
problems with their neighbors, and then the cycle 
repeats. The communication step enforces a partial 
synchronization between neighbors. More details 
on the behavior of the system and the implementa- 
tion can be found in reference [3]; here, only the 
main design ideas are outlined. 

The basic ideas which are responsible for the suc- 
cess of the present distributed system are as fol- 
lows. First, the small ratio of communication-to- 
computation has already been mentioned earlier, 
and will be discussed again in the next section. Sec- 
ond, the system utilizes a fixed number of worksta- 
tions (fixed static decomposition of the problem). 
Third, the system utilizes typically about 4/5 of the 
total number of non-dedicated workstations which 
are available in the cluster; namely, 20 out of 25. 
This strategy simplifies things, and enables the mi- 
gration of a parallel subprocess from a workstation 
which becomes busy to a free workstation when 
necessary. Other approaches which vary dynami- 
cally the load per workstation and the number of 
workstations (for example, the idea of "work steal- 
ing" in Blumofe&Park, reference [4]) are worth 
exploring, but they have a disadvantage for the par- 
ticular problem at hand. Namely, such approaches 
would require a finer decomposition of the prob- 
lem into many small tasks to be allocated dynam- 
ically, and this would increase the communication 
overhead. By contrast, the present approach of us- 
ing large coarse-grain subproblems and allocating 
one subproblem per workstation is very simple, has 
small overhead, and has produced very good results 
in practice. 



Regarding migration, the automatic process migra- 
tion in the present system is successful and straight- 
forward because the parallel subprocesses know 
how to handle migration requests. In particular, 
there is a global synchronization signal which is 
used before a migration to instruct all the processes 
to continue running until the start of some integra- 
tion step. When this step is reached, the processes 
that need to migrate save their state on disk and exit, 
while the remaining processes pause and wait for a 
signal to continue the computation. A monitoring 
program finds new workstations to submit the mi- 
grating jobs, and then instructs all the processes to 
continue. Each process migration is not particularly 
fast (it lasts about 20-30 seconds), but migrations 
do not happen too often. In the present system (20 
out of 25 non-dedicated workstations), there is ap- 
proximately 1 migration every 40 minutes on the 
average, and a typical simulation of aeroacoustics 
lasts about 48 hours. Thus, the simple approach 
works well in practice. 
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Figure 1: Parallel efficiency of 2D simulations. 

4. Computational performance 

As stated earlier, a cluster of 20 non-dedicated 
Hewlett-Packard workstations (HP9000/715) and 
a shared-bus Ethernet network achieve comparable 



performance on simulations of aeroacoustics as a 
64-node CM-5 dedicated supercomputer with vec- 
tor units. A comparison was done by measuring 
how many hours it takes to solve the same problem 
by the 20 workstations and by the 64-nodes CM-5. 
The CM-5 was programmed using the C* program- 
ming language, and the size and the geometry of the 
problem (grid of 800x600 fluid nodes) was fixed 
and known at compile time; otherwise, the perfor- 
mance of the CM-5 degrades. The 64-node CM-5 
and the 20 workstations achieved roughly the same 
performance. 

The above result is not surprising because each 
HP9000/715 workstation is 3-4 times faster than 
the individual processors of the CM-5. Therefore, 
if the communication is not a bottleneck, the cluster 
of 20 workstations has comparable computational 
power as the 64-node CM-5. Indeed, as we will 
see below, the communication takes only 20/100 of 
the total running time of a cluster of 20 worksta- 
tions, while 80/100 of the time is spent on compu- 
tation. One last comment regarding the comparison 
between the cluster and the CM-5 is that the com- 
parison should not be taken too far because other 
problems which have high communication require- 
ments would not run efficiently on the cluster, but 
would run efficiently on a parallel computer such 
as the CM-5 which has a powerful communication 
network. 

Figure 1 shows the efficiency (speedup/processors) 
of simulations of subsonic flow as a function of 
grain size for 2x2, 3x3, 4x4, and 5x4 decomposi- 
tions (triangles, crosses, squares, circles). The hor- 
izontal axis plots the square root of the number of 
fluid nodes in each subregion which is assigned to 
each workstation. We see that good performance is 
achieved in two-dimensional simulations when the 
subregion per processor is largerthan 100x100 fluid 
nodes. This is because the ratio of communication 
to computation (the relative overhead) decreases as 
the size of the subregions increases. The poor per- 



formance for very small subregions (abrupt drop 
in performance) is expected because the Ethernet 
network has high latency and a disproportionate 
cost for small messages. In these tests, the lat- 
tice Boltzmann numerical method (reference [5]) is 
used, which is a recently-developed explicit method 
for subsonic compressible flow. Similar results are 
obtained using traditional finite difference methods. 
A detailed description of the numerical methods and 
the measurements can be found in reference [3] . 
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Figure 2: Parallel efficiency of 3D simulations. 

A limitation of the present simulations of subsonic 
flow is that although two-dimensional simulations 
perform very well on the shared-bus Ethernet net- 
work, three-dimensional simulations perform rather 
poorly. This can be seen in figure 2 which plots 
similar data as figure 1 for 3D simulations and for 
3D decompositions. The size of the subproblems 
per processor is comparable between 3D and 2D 
(the largest size 44x44x44 in 3D is very close to 
the largest size 300x300 in 2D). These sizes are 
dictated by practical considerations, the run time of 
the computation and the memory of the worksta- 
tions. In principle, extremely large subregions per 
processor would achieve high efficiency in 3D, but 
they are not practical, and they are not considered 



here. Instead, our recommendation for 3D simu- 
lations is to improve the communication network 
using FDDI, ATM, or simply an Ethernet switch 
which provides virtual dedicated connections be- 
tween pairs of workstations. 

5. Simulations of wind instruments 

The distributed system has been applied to simulate 
directly the flow of air and the generation of tones 
in wind instruments using the compressible sub- 
sonic flow equations. Although physical details are 
not given here (reference [6] and [7]), a few rep- 
resentative results of the simulation of a soprano 
recorder are shown in figures 3 and 4. In the first 
figure, we can see the two-dimensional geometry 
of the soprano recorder, and the decomposition into 
22 workstations (dashed lines). This picture is a 
snapshot of the simulation about 30 ms after the 
initial blowing of air into the recorder. The flow 
of air and the generated vortices are plotted using 
iso-vorticity lines. About 0.8 million fluid nodes 
are used in this simulation. Figure 3 shows a mag- 
nified view of the jet-labium region, and we can 
see the jet of air oscillating at a frequency of about 
1 100 Hz and generating a musical tone. It is worth 
noting that physical measurements of the acoustic 
signal generated by the recorder are in satisfactory 
agreement with the predictions of the simulations 
(reference [7]). 

6. Conclusion 

Problems with local interactions and spatial orga- 
nization are well suited for parallel computing on 
a cluster of workstations. The simulation of sub- 
sonic CFD (aeroacoustics) has been identified as 
a particularly good example because the nature of 
the problem matches the computing requirements 
of the cluster of workstations. Further, a simple 
approach of automatic process migration has been 
described which allows the exploitation of a cluster 
of 25 non-dedicated workstations. The distributed 
system has been applied successfully to perform 
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direct simulations of the aeroacoustics of wind mu- 
sical instruments. Apart from aeroacoustic prob- 
lems, there are probably many other PDE-based 
problems which are suitable for parallel computing 
on a cluster of workstations. By combining com- 
puter science with other disciplines, the computer 
technology can be better matched with the physical 
applications. 
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Figure 3: Simulation of a 20 cm closed-end soprano recorder. Iso-vorticity contours are plotted. The 
decomposition is shown as dashed lines, and 22 workstations are used. The gray-shaded areas are not 
simulated. 















Figure 4: Jet oscillations of the 20 cm closed-end recorder at blowing speed 1104 cm/s. Frames are 
0.22 ms apart, from left to right. Iso-vorticity contours are plotted. 35.6 ms after startup. 



